This report is a portion of the AMIA 2015 Tutorial on Using R for Healthcare Data Science. All code and data available at my GitHub page.
This report will walk you through the data scientist’s workflow and how recent R packages make data science easier and more intuitive. First, let’s start with a couple of disclaimers:
To illustrate how packages released over the past few years have made these tasks easier we will walk through an entire analysis plan using data published by the International Warfarin Pharmacogenomics Consortium available on the PharmGKB website.
Taking a cue from David Rob1, Data Scientist at Stack Overflow, and Philip Guo2, Assistant Professor of Computer Science at University of Rochester, here is my view of the primary computational data science workflow:
The preparation phase of the workflow involves:
The analysis phase consists of:
Finally, the dissemination phase to share the results of their work:
Over the past few years the growth in tools aiding these steps has been phenomenal. We will cover each of these as we move through the workflow steps, but here is a summary of the different packages I’ve found useful for these steps:
Let’s load up our IWPC data! We will be using a slightly modified form of the main data set, that I have manually turned into a tab delimited text file. Although there are a number of libraries to read in excel files, the non-standard column names in the data set make it easier to work with a tsv. We are going to use read.delim() as opposed to readr’s read_tsv() for two reasons:
This last reason is the deal breaker for readr. Readr interpolates the variable type (column, date, number, etc.) based on the first 100 rows or via manual specification. Given the large number of columns (78) this becomes annoying at best. However, since we can’t take advantage of readr automatically making a tbl_df() object, so we will have to do so manually.
iwpc_data <- read.delim(file = "iwpc_data_7_3_09_revised3.txt") %>% tbl_df()
Let’s take a look at the type of data we are working with.
Looking at our data above we see there are a number of problems: